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Abstract 


Span-based joint extraction models have shown their efficiency on entity recognition and rela- 
tion extraction. These models regard text spans as candidate entities and span tuples as candidate 
relation tuples. Span semantic representations are shared in both entity recognition and relation 
extraction, while existing models cannot well capture semantics of these candidate entities and 
relations. To address these problems, we introduce a span-based joint extraction framework with 
attention-based semantic representations. Specially, attentions are utilized to calculate seman- 
tic representations, including span-specific and contextual ones. We further investigate effects 
of four attention variants in generating contextual semantic representations. Experiments show 
that our model outperforms previous systems and achieves state-of-the-art results on ACE2005, 
CoNLL2004 and ADE. 


1 Introduction 


This paper considers intra-sentence joint entity and relation extraction. For joint extraction mode can 
alleviate cascading errors and promote information utilization compared to pipelined one, this mode has 
drawn much attention. Typically, the joint extraction task is solved by sequence tagging based methods 
(Zheng et al., 2017; Chi et al., 2019). 

Rather than using sequence tagging based methods, recent works attempt to solve the task with span- 
based joint extraction mode (Dixit and Al-Onaizan, 2019; Luan et al., 2019). Typically, this mode first 
processes sentence text into text spans, which are span-based candidate entities spans” 
for short); then, calculates span semantic representations and performs span classification on them to 
obtain predicted entities; next, forms span-based candidate relation (’relation” for short) 
tuples with spans, and calculates relation semantic representations with corresponding span semantic 
representations; at last, performs relation classification on relation semantic representations and obtains 
predicted relation triples. This mode further improves joint extraction performance, whereas exists three 
problems. 

First, different tokens in span should contribute differently to span representation, which we call span- 
specific features. But existing methods treat each span token equally important (Eberts and Ulges, 
2019) or just consider span head and tail tokens (Dixit and Al-Onaizan, 2019), ignoring these sig- 
nificant features. Take the span ”a Palestinian youth” in sentence 1 of Figure 1 as an example, the 
”youth” should contribute much greater to the span representation than ”a” and ”Palestinian” when 
classifying the span into ”PER”. Second, local contextual information of relation tuples is omit- 
ted (Luan et al., 2018; Dixit and Al-Onaizan, 2019) or just calculated by max pooling way (Eberts 
and Ulges, 2019) when performing relation classification, which do not sufficiently capture informa- 
tion contained in it. Whereas local context may contain crucial information to help predict the rela- 
tions that relation tuples hold. A case study is shown in sentence 2 of Figure 1, the ”ownership” (in 
red font) can greatly help to determine the relation (7PART-WHOLE”) of the relation tuple (namely 
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Sentence |: The army said troops fired, and hit a boy, after [a Palestinian youth |rrr threw a 
stone, 


Sentence 2: The weak score is primarily the result of [Starbucks]orc current strategy of 


increasing equity ownership of [several foreign subsidiaries] orc,°*° 
Relation: <several foreign subsidiaries, Starbucks, PART-WHOLE> 


Sentence 3: [Palestinians]cre claim [all of the West Bank and Gaza].oc for a state, 
Relation: <all of the West Bank and Gaza, Palestinians, PART-\WHOLE> 


Figure 1: Sentence examples including gold entities and gold relation triples from the ACE2005 dataset, 
where ”PER”, ” ORG” etc., denote gold entity types, *>PART-WHOLE” denotes gold relation type, texts 
in bracket are spans of gold entities and underlined texts are local contexts of gold relation tuples, ’Re- 
lation” denotes gold relation triple. 


<several foreign subsidiaries, Starbucks>). Third, sentence-level contextual information is ignored in 
both span and relation classifications, which may be important compensation information for both ones. 
An example is given in sentence 3 of Figure 1, ”state” (in red font) benefits the relation classification 
of the relation tuple (namely <all of the West Bank and Gaza, Palestinians>), while state” is neither 
contained in the relation tuple nor in the local context (namely ’claim’’), but contained in other part of 
sentence 3, which is sentence-level. 

To address above issues, we introduce a span-based joint extraction model with attention-based span- 
specific and contextual semantic representations. Specifically, 1) MLP attention is used to calculate span- 
specific semantic representation; 2) attention-based sentence-level contextual semantic representation for 
span is calculated by taking span-specific semantic representation as query and sentence token sequence 
semantic representations as key, value respectively; 3) local and sentence-level contextual semantic rep- 
resentations for relation are obtained by attention calculation with relation tuple semantic representations 
and corresponding token sequence semantic representations. The advantage of this approach is that we 
can capture the most useful information to constitute efficient span and relation semantic representations. 

We take BERT (Devlin et al., 2019) as the default backbone network, and explore the three research 
questions. Moreover, we investigate effects of Multi-Head attention (Vaswani et al., 2017), Dot-Product 
attention (Luong et al., 2015), General attention (Luong et al., 2015) and Additive attention (Bahdanau et 
al., 2015) in generating contextual semantic representations. Extensive experiments on three benchmark 
datasets show that our model consistently outperforms previous systems. In addition, the Multi-Head 
attention firmly improves over other attention variants. 


2 Related Works 


Traditionally, pipelined entity and relation extraction method is divided into two subtasks, namely entity 
detection and relation classification. Various neural networks have been widely investigated for the two 
subtasks, such as RNNs (Huang et al., 2015; Ma and Hovy, 2016), CNNs (Limsopatham and Collier, 
2016) for entity detection, and RNNs (Zhang and Wang, 2015), CNNs (Zeng et al., 2014), Transformer 
(Verga et al., 2018; Wang et al., 2019) for relation classification. 

As discussed in §1, joint entity and relation extraction is typically formulated as a sequence tagging 
based task. Traditionally, table-filling methods have been widely explored (Miwa and Sasaki, 2014; 
Gupta et al., 2016), where token labels and relation labels fill the diagonal and off-diagonal of the ta- 
ble respectively. Recently, many works concentrate on leveraging deep neural networks to tackle this 
task, e.g., stacked bidirectional LSTM (Miwa and Bansal, 2016; Zheng et al., 2017), combination of 
bidirectional LSTM and CNN (Zhou et al., 2017), and combination of bidirectional LSTM and atten- 
tion mechanism (Chi et al., 2019; Nguyen and Verspoor, 2019). In addition, a novel machine reading 
comprehension based approach (Li et al., 2019) is proposed, which formulates the task as a Question & 
Answer task but still in sequence tagging based mode. 

Recently, span-based joint extraction methods have been investigated to tackle problems existing in 


sequence tagging based methods, e.g., inability to detect overlapping entities. Specially, Dixit and Al- 
Onaizan (2019) realize this method by obtaining span semantic representations through a BiLSTM over 
concatenated ELMo, word and character embeddings, then share them in both span and relation classifi- 
cations. Luan et al. (2018) obtain span semantic representations generally the same as Lee et al. (2017), 
but reinforce them by introducing coreference task. Follow Luan et al. (2018), Luan et al. (2019) propose 
DyGIE, which can capture span interactions through a span graph constructed dynamically. Wadden et 
al. (2019) deliver further performance increases on DyGIE by replacing the BiLSTM with BERT and 
introduce DyGIE++. More recently, Eberts and Ulges (2019) propose SpERT, a simple but effective 
span-based model that takes BERT as backbone and use two FFNNSs to classify span and relation re- 
spectively. Unlike previous works, SpERT dramatically reduces model training complexity by adopting 
negative sampling. 

Our work follows SpERT, but differs in span-specific and contextual semantic representations. Specif- 
ically, our model obtains these semantic representations with attention mechanism. By calculating the 
matching degree between target sequence semantic representations and source sequence semantic repre- 
sentations, attention mechanism obtains attention scores on the source sequence, which are weight scores 
in essence. And the more important the information, the higher weight score it holds. Classified by im- 
plementation manners of score function, attention mechanism has multiple variants, e.g., Content-Base 
attention (Graves et al., 2014), Additive attention (Bahdanau et al., 2015), General attention (Luong et 
al., 2015), Dot-Product attention (Luong et al., 2015), Multi-Head attention (Vaswani et al., 2017). 


3 Approach 


In the rest, we abbreviate ’semantic representation” as ’’representation’’. Figure 2 gives an overview 
of our model, which uses BERT as encoder following SpERT: we map word embeddings into BERT 
embeddings using pre-trained Transformer blocks (Vaswani et al., 2017). Based on the representations, 
we calculate span representations and perform span classification & filtration (§3.1); Then, we organize 
relation tuples, calculate relation representations and perform relation classification & filtration (83.2); 
Third, we investigate effects of multiple attention variants in generating contextual representations (§3.3); 
At last, we introduce model training settings (83.4). 

Define a sentence and a span from the sentence to help introduce the rest, where ¢ denotes tokens and 
subscripts (e.g., 1, 2, 3...) denote token indexes, as: 


sentence : S = (t1, ta, t3,...,tn) 


span : s = (ti, ti+1, ti+2, atiga) 


3.1 Span Classification and Filtration 


Add NoneEntity type to the pre-defined entity types (denoted as 7). Spans will be classified into 
NoneEntity as long as they do not hold any pre-defined entity types. 

As Figure 2 shows, span representation for classification composes of four parts, namely a) concate- 
nation of span head and tail representations, b)span-specific representation, c) sentence-level contextual 
representation, and d) span width embedding. We use X; to denote the BERT embedding of token t;, 
and the BERT embedding sequences of S and s are defined as follows, where Xo denotes the BERT 
embedding of [CLS]: 

Bs = |Xo, X1, X2, X3, «+; Xn] 


Bs = | Ap Xizi, a --- Ky) 


Concatenation of span head and tail representations. If a span composes of more than one to- 
ken, then concatenate the BERT embeddings of span head and tail tokens. Else, duplicate the BERT 
embedding of the single token and concatenate them. The concatenation result for span s is as: 


Hs = [Xi; Xi+;] (1) 
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Figure 2: Our joint extraction model with attention-based span-specific and contextual representations. 
1) MLP attention is utilized to obtain span-specific representation; 2) Sentence-level contextual repre- 
sentation for span is obtained by attention calculation between span-specific representation and sentence 
token embedding sequence; 3) Relation local and sentence-level contextual representations are calculated 
by relation tuple representation attending to corresponding token embedding sequences. 


Span-specific representation. Here we use MLP attention (Dixit and Al-Onaizan, 2019) to calculate 
span-specific representation. Take span s as an example. 


v= MLP;( Xx) st. ke li,i + j] (2a) 
exp(V, 
ie = ne (2b) 
2 exp(Vm) 
i+j 
Fam 5 AmXm (2c) 


Where ), is a scalar; ay, is the attention weight of Xg, computed by Softmax function; Fs is the span- 
specific representation by matrix calculation on attention weights and Bs. By this way, we can evaluate 
the significance of each span token, and the more important a token, the larger attention weight it holds. 

Sentence-level contextual representation. Take Fs as query, Bs as key and value respectively, 
sentence-level contextual representation for span s is calculated as: 


Ts = Attention(F;, Bs, Bs) (3) 


Information beneficial for span classification will be assigned a heavy weight, and the contextual 
representation will be taken to constitute span representation. 

Span width embedding. Span width embedding allow the model to incorporate a prior over the 
span width. Fixed-size embedding for each span width 1,2,...(Lee et al., 2017) is learned during model 
training. Thus, we can look up a width embedding W +1 from the embedding matrix for s. 

Span classification. The final span representation for classification is as: 


Rs = [Tat Fs; Hat Weal (4) 


Rs first passes through a multi-layers FFNN and then is fed into a Softmax classifier, which yields a 
posterior for s on 7) (including NoneEntity), as: 


Ys = Softmax(FFNN(R;)) (5) 


Span filtration. By searching the highest-scored class, the ys estimates which entity type does s holds. 
We just keep spans that are not classified into NoneEntity, and form a predicted entity set €. Then we 
perform relation classification on relation tuples derived from {E ® E} to reduce searching space, where 
® denotes Cartesian Product. 


3.2 Relation Classification and Filtration 


Add NoneRelation type to the pre-defined relation types (denoted as y). Let s1, sg be two spans, relation 
tuples taken for relation classification are defined as: 


< 81,82 >E {EQ E} s.t. s1 # 89 


As Figure 2 shows, relation representation for classification composes of three parts, namely a) con- 
catenation of relation tuple representations, b) local contextual representation, c) sentence-level contex- 
tual representation. 

Concatenation of relation tuple representations. Before concatenating Rs, and Rs,, we first apply 
a multi-layers FFNN to them to reduce their dimensions. The concatenation result is as: 


Hr = [FFNN(Rs, ); FFNN(R,, )| (6) 


Local contextual representation. Let Be denotes the BERT embedding sequence of local context 
between sı and s9, as: 
Be = (Xm, Xm41; Xm+2, tee ,Xm4n) 


The attention-based local contextual representation is calculate by taking Hr as query, Be as key and 
value respectively, as: 
F, = Attention(H,, Be, Be) (7) 


Sentence-level contextual representation. The sentence-level contextual representation is calculated 
by taking Hr as query, Bs as key and value respectively, as: 


T, = Attention(H,;, Bs, Bs) (8) 


Relation classification. Before F, and 7, are taken to constitute relation representation, we first apply 
two different multi-layers FFNNs to them to reduce their dimensions, aiming to keep them in a proper 
proportion in relation representation. The final relation representation for classification is as: 


Rr = [Hr; FENN} (Fr); FFNN7(7r)| (9) 


Akin to span classification, Ry first passes through a multi-layers FFNN and then is fed into a Softmax 
classifier, which yields a posterior for <s1, 52> on y (including NoneRelation), as: 


Yr = Softmax(FFNN(R;)) (10) 


Relation filtration. By searching the highest-scored class, the y, estimates which relation type does 
<8 1, 52> holds. Only relation tuples that are not classified into NoneRelation are kept and compose 
predicted relation triples with predicted types. 

3.3 Attention Variants 


In this paper, we investigate effects of Multi-Head attention, Additive attention, Dot-Product attention 
and General attention in generating contextual representations, of which score functions are shown below. 


QOK 
vdr 


Additive attention : score = W: Q +W- K 


Multi — Head attention : score = 


Dot — Product attention : score = W- (QO K) 
General attention : score =Q. W.K 


Where Q, K denote query and key respectively; W denotes parameter matrix; dx denotes dimension 
of K; - and © denote matrix broadcast and element-wise multiplication respectively. For Multi-Head 
attention and Dot-Product attention, we first apply different multi-layers FFNNs to Q and K, aiming to 
convert them to the same dimension. 


3.4 Model Training 


Parameter matrices of FFNNs and attentions are learned, and BERT is fine-tuned during model training. 
The joint loss function of our model is defined as: 


L=0.4L5 + 0.6L" (11) 


Where L5 denotes the cross-entropy loss of span classification and £" denotes the binary cross-entropy 
loss of relation classification. Due to the fact that performance of relation classification is generally worse 
than span one, we apply a larger weight score to £", aiming to let the model focus more on relation 
classification. 

Negative sampling (Eberts and Ulges, 2019) are adopted during model training to improve model 
performance and robustness. Unlike previous works, we adopt a dynamic sampling strategy, where the 
negative examples for both entity and relation are thirtyfold of the ground truth ones in each sentence. 
By this strategy, our model keeps a much more balanced data distribution on training data. 


4 Experiments 


4.1 Datasets 


We test our model on ACE2005, CoNLL2004 and ADE, which are refered as ACE05, CoNLLO4 and 
ADE respectively in the rest. 


e ACE05 (Doddington et al., 2004) English dataset composes of news articles in multi-domain, e.g., 
broadcast, newswire, weblog etc.. Seven coarse-grained entity types and six coarse-grained relation 
types are pre-defined. We follow the training/dev/test split in (Li and Ji, 2014; Li et al., 2019). It has 
351 documents for training, 80 for development and 80 for test, of which 437 contain overlapping 
entities. 


CoNLL04 (Roth and Yih, 2004) composes of news articles from outlets such as WSJ and AP, We 
follow the training/dev/test split in (Adel and Schütze, 2017; Bekoulis et al., 2018), which consists 
910 articles for training, 243 for development and 288 for test. 


e ADE (Gurulingappa et al., 2012) aims to extract drug-related adverse effects from medical text, 
including two pre-defined entity types (namely Adverse-Effect and Drug) and a single relation type 
i.e., Adverse-Effect. It consists of 4272 sentences, of which 1695 contain overlapping entities. 
We conduct 10-fold cross validation following (Bekoulis et al., 2018; Eberts and Ulges, 2019). 


For ACE05, following (Li et al., 2019; Luan et al., 2019), an entity is considered correct if we can 
identify its head region and type correctly. A relation is considered correct if we can identify its argument 
entities and type correctly. For CONLL04 and ADE, following (Li et al., 2019; Eberts and Ulges, 2019), 
we treat an entity as correct when its type and entity region match ground truth, and treat a relation as 
correct when its type and argument entities match ground truth. 


4.2 Experimental Settings 


We build our model upon English cased version of BERTBasn. We set the negative sampling rate 
to 30, batch size for model training to 8, dropout to 0.2 and width embedding dimension to 50. For 
Multi-Head attention, the head number is set to 8. FENN z and FF NN7 contain three fully connected 
layers; and all the other FFNNs contain two layers. We set different epochs in case of different datasets. 
For all datasets, the span width threshold is initialized to 10. In our model, we adopt weight loss setting, 
as shown in equ.(11), and follow Eberts and Ulges (2019) for other hyperparameter settings. 


4.3 Baseline Models 


We compare our model with the following models. 


e DyGIE (Luan et al., 2019) is the current span-based state-of-the-art model on ACEOS. It reinforces 
span and relation representations by introducing coreference task. 


e Multi-turn QA (Li et al., 2019) is the current sequence tagging based state-of-the-art model on 
ACEO05 and CoNLL04. It formulates joint entity and relation extraction as a multi-turn question and 
answer task, but still in sequence tagging based mode. 


e SpERT (Eberts and Ulges, 2019) is the current span-based state-of-the-art model on ADE and 
CoNLL04. Our work follows this model but adopts attention-based span-specific and contextual 
representations. 


e Relation-Metric (Tran and Kavuluru, 2019) is a sequence tagging based model in multi-task learn- 
ing scheme. It reports performances on ADE and CoNLL04, and achieves state-of-the-art on ADE. 


Entity Relation 
P R Fl P R Fl 

Multi-turn QA t 84.7 849 84.8 64.8 56.2 60.2 

ACE0S DyGIE?¢ - - 88.4 - - 63.2 
SPAN wuiti-Head 89.32 89.86 89.59 71.22 60.19 65.24 

Multi-turn QA t 89.0 86.6 87.8 69.2 68.2 68.9 
CoNLL04 SpERT t¢ 88.25 89.64 88.94 73.04 70.00 71.47 
SPAN wuiti-Head 90.11 90.36 90.23 76.96 71.88 74.33 
Relation-Metric + 86.16 88.08 87.11 77.36 71.25 77.29 
ADE SpERT ¢ 88.99 89.59 89.28 77.77 79.96 78.84 
SPAN wuiti-Head 89.88 91.32 90.59 79.56 81.93 80.73 


Dataset Method 


Table 1: Results of different models on ACE05, CoONLLO4 and ADE test sets. SPAN Multi—Head OUt- 
performs previous SOTA models by +1.19, +1.29, +1.31 in entity recognition, and +2.04, +2.86, +1.89 
in relation extraction. (previous sequence tagging based SOTA {; previous span-based SOTA t+) 


Method ACE05 CoNLL04 ADE 
Entity Relation Entity Relation Entity Relation 
(F1) (F1) (F1) (F1) (F1) (F1) 
SPAN Multi—-Head 89.59 65.24 90.23 74.33 90.59 80.73 
SPAN pot— Product 87.94 62.88 88.23 70.89 88.15 77.31 
SPAN General 88.66 63.56 88.96 73.48 89.93 80.14 
SPAN additive 89.07 64.53 89.17 71.36 89.68 79.75 


Table 2: Performance comparations of attention variants. Multi-Head attention firmly outperforms the 
others, with an absolute F1 increase up to 2.44(ADE) in entity recognition and 3.44(CoNLLO04) in rela- 
tion extraction. 


4.4 Main Results 


We compare our model with both sequence tagging based methods and span-based methods on the three 
benchmark datasets, and show the results in Table 1. We denote our method as SPAN Multi-Head, Which 
means we use Multi-Head attention to calculate contextual representations. For ACEO5 and CoNLL04, 
we report performance under micro-average metrics, and apply macro-average metrics to ADE perfor- 
mance, which follow prior works. For ACE05 and ADE, all the reported performances take overlapping 
entities into consideration. 


SPAN wruiti_Head consistently outperforms both sequence tagging based and span-based SOTA 
methods on the three benchmark datasets. Compared to SpERT, SPAN Multi—Heaq delivers perfor- 
mance increases in entity recognition by 1.29(CoNLL04) and 1.31(ADE), while better ones in relation 
extraction, by 2.86(CoNLL04) and 1.89(ADE). We owe these performance increases to efficient span- 
specific representation and contextual representations. Moreover, SPAN Multi—Heaq delivers solid per- 
formance increases compared to DyGIE by 1.19 and 2.04 in entity recognition and relation extraction on 
ACE0S5. However, it’s worth noting that DyGIE adopts a multi-task learning scheme, reinforcing span 
representations by introducing coreference task, which is absent in our method. 


Besides Multi-Head attention, we investigate Dot-Product attention, General attention and Additive 
attention in our method. Table 2 shows performances of these attention variants on the three benchmark 
datasets. SPAN yruiti-Head Consisiently outperforms the other three methods. One possible reason 
is that in Multi-Head attention, the eight attention heads attend to different contextual information and 
learn features from different representation spaces. Thus Multi-Head attention based contextual repre- 
sentations can better compensate span and relation representations. 


Entity Relation Entity Relation 


Method (F1) (FI) Method FD FD 
SPAN Multi- Head 88.10 62.13 SPAN Multi—-Head 88.10 62.13 
-SpanSpecific 86.78 60.21 -local 87.96 60.56 
SentenceLevel 87.57 61.12 SentenceLevel 88.21 61.77 
base 85.80 59.00 base 87.91 59.66 


Table 3: Ablations on ACEOS dev set with differ- Table 4: Ablations on ACE05 dev set with differ- 
ent span-specific and span sentence-level contex- ent relation local and sentence-level contextual 
utal representation settings. representation settings. 


5 Ablation Study 


Based on SPAN Multi- Head, We conduct ablations on the ACE2005 dev set to analyze effects of differ- 
ent model components. 


5.1 Span-specific and Sentence-level Contextual Representations for Span 


Table 3 shows effects of span-specific representation and sentence-level contextual representation for 
span in our model, where -SpanSpecific denotes ablating the span-specific representation by replac- 
ing [F5; Hs] in equ.(4) with the max pooling of Bs; -SentenceLevel denotes ablating the sentence- 
level contextual representation by replacing 7, in equ.(4) with the BERT embedding of [CLS] generated 
on Bs; base is the model by performing above two ablations, which is the default span representation 
settings in SPERT. For ACEOS, we observe that both span-specific representation and sentence-level con- 
textual representation are helpful for both entity recognition and relation extraction. This is due to that 
span representations are shared in the two subtasks. 


5.2 Local and Sentence-level Contextual Representations for Relation 


Table 4 shows effects of local and sentence-level contextual representations for relation in our model, 
where -local denotes ablating local representation by replacing F, in equ.(9) with the max pool- 
ing of Be; -SentenceLevel denotes ablating sentence-level contextual representation by removing 
FFNN7(7;) from equ.(9); base is the model by performing above two ablations, which is the default 
relation representation settings in SpERT. For ACE05, we observe that both local and sentence-level con- 
textual representations apparently benefit relation extraction, while have negligible influence on entity 
recognition. A convincing explanation is that these representations directly constitute relation represen- 
tation, while affect span representation only by backpropagation. 


It is worth noting that local contextual representation has a greater impact on relation extraction com- 
pared to sentence-level one. One reason for this is that information determining relation type mainly 
exists in relation tuples and the local context. Another reason is that as compensation information, 
sentence-level contextual representation occupies a relative small proportion in relation representation, 
aiming to avoid introducing noise into relation representation. 


6 Conclusion 


We introduce attention-based semantic representation generating methods in span-based joint entity and 
relation extraction method. We apply MLP attention to capture span-specific features aiming to obtain 
semantic rich span representation, and calculate task-specific contextual representations with attention 
architecture to further reinforce span and relation representations. Our approach firmly outperforms both 
the sequence tagging based and span-based SOTA methods on three benchmark datasets, creating new 
state-of-the-art results. As future work, we would like to consider further improving relation classifica- 
tion performance by reducing span classification errors. We also plan to explore more advanced methods 
for encoding efficient span and relation representations. 
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